Introduction

Preview of the dataset

Our primary dataset is the GTZAN Dataset, which includes data spanning ten genres, with a hundred 30-second audio files for each. For the purposes of my analysis, I utilized the segmented dataset, where each original audio file has been partitioned into 3-second clips. This segmentation allows for a more granular examination of the musical features embedded within the dataset. This data was collected in 2000-2001 using a variety of sources such as CDs, radio, and microphone recordings, to account for potential variations in recording methods.

Examining Relationships

Findings from Lasso Regression

Lasso regression retained all features with non-zero coefficients, indicating their significance in predicting musical genre. Despite this, I chose to further explore multicollinearity to enhance interpretability and gain deeper insights into the relationships within the data.

Correlation Heatmap

During exploratory data analysis, I found significant correlations among variables. The plot below illustrates the correlation coefficients between the first 25 numeric variables in the dataset (excluding constants).

Scatter Plot

The line plot highlights a strong relationship (r = 0.89) observed between the mean spectral bandwidth and spectral centroid across genres. Through interactive filtering, you can refine the view by clicking (and double-clicking) genre labels on the plot’s right side. I decided to filter results by genre for a more focused exploratory comparative analysis on genre-specific patterns.

These observations hint at potential redundancy or multicollinearity within the dataset. To address this issue, I explored dimensionality reduction techniques aimed at enhancing robustness and reducing noise.

Dimension Reduction: Principal Component Analysis

Scree plots for variance explained (left) and cumulative variance explained (right)

When aiming to capture 90% of the variance, there is a modest reduction in dimensions, decreasing from 57 to 31.

PCA Plot

The lack of clear separation observed in the PCA plot suggests that the genres of music are not easily distinguishable within the reduced-dimensional space defined by the principal components. This indicates potential complexity within the data or significant overlap between classes in terms of their features. Given this observation, which implies that the data does not lend itself well to PCA, I will not utilize the reduced dataset to aid in classification.

Linear Discriminant Analysis

Accuracy Distribution

The plot below illustrates the distribution of test accuracies across musical genres obtained using Linear Discriminant Analysis (LDA) from fifty train-test (80-20 split) iterations. The red dashed line marks the overall accuracy of LDA, which was 67.22%.

Preview: Predictions across 50 trials

This dataframe comprises all predictions collected across the fifty train-test iterations. It contains a total of 99,900 observations.

predicted_label true_label
rock blues
blues blues
reggae blues
blues blues
reggae blues
blues blues

Model-predicted genres for true “blues” labels

Full view: Model-predicted genres for each true genre label

Quadratic Discriminant Analysis

Accuracy Distribution

The plot below illustrates the distribution of test accuracies across musical genres obtained using Quadratic Discriminant Analysis (QDA) from fifty train-test (80-20 split) iterations. The red line displays the overall accuracy, which was 76.81%.

Model-predicted genres for true “blues” labels

Full view: Model-predicted genres for each true genre label

Feedforward Neural Network

Neural Network Architecture

Following experimentation involving diverse layer structures and various activation functions (e.g., sigmoid instead of ReLU), it became evident that this particular architecture yielded the most favorable predictive outcomes:

  nn_mod <- keras_model_sequential() %>% 
    layer_dense(units = 128, activation = "relu", input_shape = c(57)) %>% 
    layer_dense(units = 64, activation = "relu") %>% 
    layer_dense(units = 10, activation = "softmax")

Hyperparameter Optimization

I conducted a grid search across the parameter space defined by the batch sizes (8, 16, 32, 64) and epochs (10, 15, 20, 25) to determine the optimal configuration that would minimize validation loss. A batch size of 16 paired with 15 epochs were found to be the most effective parameters.

Optimized Neural Network Performance

The following plots display the results obtained during the training of the neural network, using the optimal hyperparameters. Ultimately, the trained model demonstrated a test accuracy ranging from 84% to 87%.

K-Nearest Neighbors

Optimizing k

After standardizing the features through centering and scaling, I optimized the k parameter for the K-nearest neighbors (KNN) algorithm. Utilizing 5-fold cross-validation, I evaluated multiple k values to pinpoint the one that minimizes classification error.

Testing KNN: 80-20 Split

After partitioning the dataset into training and testing subsets with an 80-20 ratio, I performed the K-nearest neighbors (KNN) algorithm utilizing the optimal k value of 1. This KNN model has an accuracy of 90.69%.

## Registered S3 methods overwritten by 'proxy':
##   method               from    
##   print.registry_field registry
##   print.registry_entry registry
Confusion Matrix
blues classical country disco hiphop jazz metal pop reggae rock
blues 178 1 4 3 0 5 1 0 2 3
classical 0 171 3 2 1 10 0 0 0 0
country 5 4 168 2 2 2 0 7 3 5
disco 4 1 0 191 1 0 0 7 3 7
hiphop 0 0 0 0 186 0 0 3 0 3
jazz 3 6 4 1 0 173 0 1 0 2
metal 1 0 1 3 1 0 200 0 0 1
pop 0 0 0 2 1 1 0 164 2 4
reggae 9 0 5 2 4 1 1 3 197 5
rock 3 2 11 3 0 2 3 3 1 184
Overall Metrics
Value
Accuracy 0.9069
Kappa 0.8965
AccuracyLower 0.8933
AccuracyUpper 0.9193
AccuracyNull 0.1071
AccuracyPValue 0.0000
McnemarPValue NaN

Testing KNN: Leave-One-Out Cross-Validation

I conducted Leave-One-Out Cross-Validation (LOOCV) employing the KNN algorithm with the optimal k value of 1 on the dataset. The resulting accuracy was 92.22%. Notably, LOOCV offers a more thorough validation approach compared to traditional 80-20 splits, as it evaluates the model’s performance on every single data point, providing a comprehensive assessment of its generalization capability.

Confusion Matrix
blues classical country disco hiphop jazz metal pop reggae rock
blues 945 4 28 1 1 15 5 0 6 8
classical 1 933 7 6 2 46 1 0 1 5
country 15 8 841 15 5 17 0 17 12 26
disco 3 1 17 940 9 4 2 35 6 38
hiphop 4 1 8 5 946 2 3 17 5 5
jazz 8 41 20 4 1 902 1 5 3 8
metal 0 1 2 2 4 0 968 1 0 14
pop 1 1 11 6 12 3 0 904 6 7
reggae 14 0 36 4 16 4 1 14 956 9
rock 9 8 27 16 2 7 19 7 5 878
Overall Metrics
Value
Accuracy 0.9222
Kappa 0.9136
AccuracyLower 0.9168
AccuracyUpper 0.9274
AccuracyNull 0.1001
AccuracyPValue 0.0000
McnemarPValue 0.0001

Conclusion

Comparison of Classification Techniques

After conducting classification experiments on musical genres using LDA, QDA, Neural Networks, and KNN, I discover that KNN with k=1 yields the highest accuracy rate of 92.22%. While musical genres may be somewhat subjective, this model excels at recognizing the intricate patterns embedded within the audio features associated with various genres. This emphasizes the effectiveness of KNN in genre classification tasks, offering valuable insights into the discriminative power of audio features across diverse musical genres.